951 research outputs found
An Open source Implementation of ITU-T Recommendation P.808 with Validation
The ITU-T Recommendation P.808 provides a crowdsourcing approach for
conducting a subjective assessment of speech quality using the Absolute
Category Rating (ACR) method. We provide an open-source implementation of the
ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended
our implementation to include Degradation Category Ratings (DCR) and Comparison
Category Ratings (CCR) test methods. We also significantly speed up the test
process by integrating the participant qualification step into the main rating
task compared to a two-stage qualification and rating solution. We provide
program scripts for creating and executing the subjective test, and data
cleansing and analyzing the answers to avoid operational errors. To validate
the implementation, we compare the Mean Opinion Scores (MOS) collected through
our implementation with MOS values from a standard laboratory experiment
conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility
of the result of the subjective speech quality assessment through crowdsourcing
using our implementation. Finally, we quantify the impact of parts of the
system designed to improve the reliability: environmental tests, gold and
trapping questions, rating patterns, and a headset usage test
Macular Bioaccelerometers on Earth and in Space
Space flight offers the opportunity to study linear bioaccelerometers (vestibular maculas) in the virtual absence of a primary stimulus, gravitational acceleration. Macular research in space is particularly important to NASA because the bioaccelerometers are proving to be weighted neural networks in which information is distributed for parallel processing. Neural networks are plastic and highly adaptive to new environments. Combined morphological-physiological studies of maculas fixed in space and following flight should reveal macular adaptive responses to microgravity, and their time-course. Ground-based research, already begun, using computer-assisted, 3-dimensional reconstruction of macular terminal fields will lead to development of computer models of functioning maculas. This research should continue in conjunction with physiological studies, including work with multichannel electrodes. The results of such a combined effort could usher in a new era in understanding vestibular function on Earth and in space. They can also provide a rational basis for counter-measures to space motion sickness, which may prove troublesome as space voyager encounter new gravitational fields on planets, or must re-adapt to 1 g upon return to earth
Trustworthy Experimentation Under Telemetry Loss
Failure to accurately measure the outcomes of an experiment can lead to bias
and incorrect conclusions. Online controlled experiments (aka AB tests) are
increasingly being used to make decisions to improve websites as well as mobile
and desktop applications. We argue that loss of telemetry data (during upload
or post-processing) can skew the results of experiments, leading to loss of
statistical power and inaccurate or erroneous conclusions. By systematically
investigating the causes of telemetry loss, we argue that it is not practical
to entirely eliminate it. Consequently, experimentation systems need to be
robust to its effects. Furthermore, we note that it is nontrivial to measure
the absolute level of telemetry loss in an experimentation system. In this
paper, we take a top-down approach towards solving this problem. We motivate
the impact of loss qualitatively using experiments in real applications
deployed at scale, and formalize the problem by presenting a theoretical
breakdown of the bias introduced by loss. Based on this foundation, we present
a general framework for quantitatively evaluating the impact of telemetry loss,
and present two solutions to measure the absolute levels of loss. This
framework is used by well-known applications at Microsoft, with millions of
users and billions of sessions. These general principles can be adopted by any
application to improve the overall trustworthiness of experimentation and
data-driven decision making.Comment: Proceedings of the 27th ACM International Conference on Information
and Knowledge Management, October 201
Multi-dimensional Speech Quality Assessment in Crowdsourcing
Subjective speech quality assessment is the gold standard for evaluating
speech enhancement processing and telecommunication systems. The commonly used
standard ITU-T Rec. P.800 defines how to measure speech quality in lab
environments, and ITU-T Rec.~P.808 extended it for crowdsourcing. ITU-T Rec.
P.835 extends P.800 to measure the quality of speech in the presence of noise.
ITU-T Rec. P.804 targets the conversation test and introduces perceptual speech
quality dimensions which are measured during the listening phase of the
conversation. The perceptual dimensions are noisiness, coloration,
discontinuity, and loudness. We create a crowdsourcing implementation of a
multi-dimensional subjective test following the scales from P.804 and extend it
to include reverberation, the speech signal, and overall quality. We show the
tool is both accurate and reproducible. The tool has been used in the ICASSP
2023 Speech Signal Improvement challenge and we show the utility of these
speech quality dimensions in this challenge. The tool will be publicly
available as open-source at https://github.com/microsoft/P.808
VCD: A Video Conferencing Dataset for Video Compression
Commonly used datasets for evaluating video codecs are all very high quality
and not representative of video typically used in video conferencing scenarios.
We present the Video Conferencing Dataset (VCD) for evaluating video codecs for
real-time communication, the first such dataset focused on video conferencing.
VCD includes a wide variety of camera qualities and spatial and temporal
information. It includes both desktop and mobile scenarios and two types of
video background processing. We report the compression efficiency of H.264,
H.265, H.266, and AV1 in low-delay settings on VCD and compare it with the
non-video conferencing datasets UVC, MLC-JVC, and HEVC. The results show the
source quality and the scenarios have a significant effect on the compression
efficiency of all the codecs. VCD enables the evaluation and tuning of codecs
for this important scenario. The VCD is publicly available as an open-source
dataset at https://github.com/microsoft/VCD
Real-time Bandwidth Estimation from Offline Expert Demonstrations
In this work, we tackle the problem of bandwidth estimation (BWE) for
real-time communication systems; however, in contrast to previous works, we
leverage the vast efforts of prior heuristic-based BWE methods and synergize
these approaches with deep learning-based techniques. Our work addresses
challenges in generalizing to unseen network dynamics and extracting rich
representations from prior experience, two key challenges in integrating
data-driven bandwidth estimators into real-time systems. To that end, we
propose Merlin, the first purely offline, data-driven solution to BWE that
harnesses prior heuristic-based methods to extract an expert BWE policy.
Through a series of experiments, we demonstrate that Merlin surpasses
state-of-the-art heuristic-based and deep learning-based bandwidth estimators
in terms of objective quality of experience metrics while generalizing beyond
the offline world to in-the-wild network deployments where Merlin achieves a
42.85% and 12.8% reduction in packet loss and delay, respectively, when
compared against WebRTC in inter-continental videoconferencing calls. We hope
that Merlin's offline-oriented design fosters new strategies for real-time
network control
Meeting effectiveness and inclusiveness: large-scale measurement, identification of key features, and prediction in real-world remote meetings
Workplace meetings are vital to organizational collaboration, yet relatively
little progress has been made toward measuring meeting effectiveness and
inclusiveness at scale. The recent rise in remote and hybrid meetings
represents an opportunity to do so via computer-mediated communication (CMC)
systems. Here, we share the results of an effective and inclusive meetings
survey embedded within a CMC system in a diverse set of companies and
organizations. We correlate the survey results with objective metrics available
from the CMC system to identify the generalizable attributes that characterize
perceived effectiveness and inclusiveness in meetings. Additionally, we explore
a predictive model of meeting effectiveness and inclusiveness based solely on
objective meeting attributes. Lastly, we show challenges and discuss solutions
around the subjective measurement of meeting experiences. To our knowledge,
this is the largest data-driven study conducted after the pandemic peak to
measure, understand, and predict effectiveness and inclusiveness in real-world
meetings at an organizational scale
- …